metamorphic testing
Metamorphic Testing of Multimodal Human Trajectory Prediction
Spieker, Helge, Lazaar, Nadjib, Gotlieb, Arnaud, Belmecheri, Nassim
Context: Predicting human trajectories is crucial for the safety and reliability of autonomous systems, such as automated vehicles and mobile robots. However, rigorously testing the underlying multimodal Human Trajectory Prediction (HTP) models, which typically use multiple input sources (e.g., trajectory history and environment maps) and produce stochastic outputs (multiple possible future paths), presents significant challenges. The primary difficulty lies in the absence of a definitive test oracle, as numerous future trajectories might be plausible for any given scenario. Objectives: This research presents the application of Metamorphic Testing (MT) as a systematic methodology for testing multimodal HTP systems. We address the oracle problem through metamorphic relations (MRs) adapted for the complexities and stochastic nature of HTP. Methods: We present five MRs, targeting transformations of both historical trajectory data and semantic segmentation maps used as an environmental context. These MRs encompass: 1) label-preserving geometric transformations (mirroring, rotation, rescaling) applied to both trajectory and map inputs, where outputs are expected to transform correspondingly. 2) Map-altering transformations (changing semantic class labels, introducing obstacles) with predictable changes in trajectory distributions. We propose probabilistic violation criteria based on distance metrics between probability distributions, such as the Wasserstein or Hellinger distance. Conclusion: This study introduces tool, a MT framework for the oracle-less testing of multimodal, stochastic HTP systems. It allows for assessment of model robustness against input transformations and contextual changes without reliance on ground-truth trajectories.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (7 more...)
- Information Technology (0.69)
- Transportation > Ground > Road (0.68)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models
de Oliveira, Matheus Vinicius da Silva, Silva, Jonathan de Andrade, Fontao, Awdren de Lima
Large Language Models (LLMs) are widely used across multiple domains but continue to raise concerns regarding security and fairness. Beyond known attack vectors such as data poisoning and prompt injection, LLMs are also vulnerable to fairness bugs. These refer to unintended behaviors influenced by sensitive demographic cues (e.g., race or sexual orientation) that should not affect outcomes. Another key issue is hallucination, where models generate plausible yet false information. Retrieval-Augmented Generation (RAG) has emerged as a strategy to mitigate hallucinations by combining external retrieval with text generation. However, its adoption raises new fairness concerns, as the retrieved content itself may surface or amplify bias. This study conducts fairness testing through metamorphic testing (MT), introducing controlled demographic perturbations in prompts to assess fairness in sentiment analysis performed by three Small Language Models (SLMs) hosted on HuggingFace (Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3.1-Nemotron-8B), each integrated into a RAG pipeline. Results show that minor demographic variations can break up to one third of metamorphic relations (MRs). A detailed analysis of these failures reveals a consistent bias hierarchy, with perturbations involving racial cues being the predominant cause of the violations. In addition to offering a comparative evaluation, this work reinforces that the retrieval component in RAG must be carefully curated to prevent bias amplification. The findings serve as a practical alert for developers, testers and small organizations aiming to adopt accessible SLMs without compromising fairness or reliability.
An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software
Gogani-Khiabani, Sina, Trivedi, Ashutosh, Saha, Diptikalyan, Tizpaz-Niari, Saeid
Large language models (LLMs) show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher-order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error-prone, we use an LLM-driven, role-based framework to automate test generation and code synthesis. We implement a multi-agent system that translates tax code into executable software and incorporates a metamorphic-testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT-4o-mini) achieves a worst-case pass rate of 45%, outperforming frontier models (GPT-4o and Claude 3.5, 9-15%) on complex tax-code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal-critical software from natural-language specifications.
- North America > United States > Colorado > Boulder County > Boulder (0.14)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
- Law > Taxation Law (1.00)
- Government > Tax (1.00)
- Government > Regional Government > North America Government > United States Government (0.48)
Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning
Racharak, Teeradaj, Ragkhitwetsagul, Chaiyong, Sontesadisai, Chommakorn, Sunetnanta, Thanwadee
In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling them to perform new tasks based on a few provided examples without explicit fine-tuning. Despite their impressive adaptability, these models remain vulnerable to subtle adversarial perturbations and exhibit unpredictable behavior when faced with linguistic variations. Inspired by software testing principles, we introduce a software testing-inspired framework, called MMT4NL, for evaluating the trustworthiness of in-context learning by utilizing adversarial perturbations and software testing techniques. It includes diverse evaluation aspects of linguistic capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around the idea of crafting metamorphic adversarial examples from a test set in order to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is to treat any LLM as software and validate its functionalities just like testing the software. Finally, we demonstrate applications of MMT4NL on the sentiment analysis and question-answering tasks. Our experiments could reveal various linguistic bugs in state-of-the-art LLMs.
- Asia > Pakistan > Sindh > Karachi Division > Karachi (0.04)
- Asia > Japan > Honshū > Tōhoku (0.04)
- North America > United States > Texas (0.04)
- (6 more...)
Metamorphic Testing of Deep Code Models: A Systematic Literature Review
Asgari, Ali, de Koning, Milan, Derakhshanfar, Pouria, Panichella, Annibale
Large language models and deep learning models designed for code intelligence have revolutionized the software engineering field due to their ability to perform various code-related tasks. These models can process source code and software artifacts with high accuracy in tasks such as code completion, defect detection, and code summarization; therefore, they can potentially become an integral part of modern software engineering practices. Despite these capabilities, robustness remains a critical quality attribute for deep-code models as they may produce different results under varied and adversarial conditions (e.g., variable renaming). Metamorphic testing has become a widely used approach to evaluate models' robustness by applying semantic-preserving transformations to input programs and analyzing the stability of model outputs. While prior research has explored testing deep learning models, this systematic literature review focuses specifically on metamorphic testing for deep code models. By studying 45 primary papers, we analyze the transformations, techniques, and evaluation methods used to assess robustness. Our review summarizes the current landscape, identifying frequently evaluated models, programming tasks, datasets, target languages, and evaluation metrics, and highlights key challenges and future directions for advancing the field.
- Europe > Netherlands > South Holland > Delft (0.40)
- Oceania > Australia > Victoria > Melbourne (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- Overview (1.00)
- Research Report > New Finding (0.67)
ASSURE: Metamorphic Testing for AI-powered Browser Extensions
Gao, Xuanqi, Zhai, Juan, Ma, Shiqing, Xie, Siyi, Shen, Chao
The integration of Large Language Models (LLMs) into browser extensions has revolutionized web browsing, enabling sophisticated functionalities like content summarization, intelligent translation, and context-aware writing assistance. However, these AI-powered extensions introduce unprecedented challenges in testing and reliability assurance. Traditional browser extension testing approaches fail to address the non-deterministic behavior, context-sensitivity, and complex web environment integration inherent to LLM-powered extensions. Similarly, existing LLM testing methodologies operate in isolation from browser-specific contexts, creating a critical gap in effective evaluation frameworks. To bridge this gap, we present ASSURE, a modular automated testing framework specifically designed for AI-powered browser extensions. ASSURE comprises three principal components: (1) a modular test case generation engine that supports plugin-based extension of testing scenarios, (2) an automated execution framework that orchestrates the complex interactions between web content, extension processing, and AI model behavior, and (3) a configurable validation pipeline that systematically evaluates behavioral consistency and security invariants rather than relying on exact output matching. Our evaluation across six widely-used AI browser extensions demonstrates ASSURE's effectiveness, identifying 531 distinct issues spanning security vulnerabilities, metamorphic relation violations, and content alignment problems. ASSURE achieves 6.4x improved testing throughput compared to manual approaches, detecting critical security vulnerabilities within 12.4 minutes on average. This efficiency makes ASSURE practical for integration into development pipelines, offering a comprehensive solution to the unique challenges of testing AI-powered browser extensions.
- North America > United States > District of Columbia > Washington (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
AI-Augmented Metamorphic Testing for Comprehensive Validation of Autonomous Vehicles
Zhang, Tony, Kantarci, Burak, Siddique, Umair
Self-driving cars have the potential to revolutionize transportation, but ensuring their safety remains a significant challenge. These systems must navigate a variety of unexpected scenarios on the road, and their complexity poses substantial difficulties for thorough testing. Conventional testing methodologies face critical limitations, including the oracle problem determining whether the systems behavior is correct and the inability to exhaustively recreate a range of situations a self-driving car may encounter. While Metamorphic Testing (MT) offers a partial solution to these challenges, its application is often limited by simplistic modifications to test scenarios. In this position paper, we propose enhancing MT by integrating AI-driven image generation tools, such as Stable Diffusion, to improve testing methodologies. These tools can generate nuanced variations of driving scenarios within the operational design domain (ODD)for example, altering weather conditions, modifying environmental elements, or adjusting lane markings while preserving the critical features necessary for system evaluation. This approach enables reproducible testing, efficient reuse of test criteria, and comprehensive evaluation of a self-driving systems performance across diverse scenarios, thereby addressing key gaps in current testing practices.
- North America > Canada > Ontario > National Capital Region > Ottawa (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Austria > Vienna (0.04)
- Automobiles & Trucks (1.00)
- Transportation > Ground > Road (0.70)
- Information Technology > Robotics & Automation (0.70)
- Transportation > Passenger (0.54)
Metamorphic Testing for Pose Estimation Systems
Duran, Matias, Laurent, Thomas, Rushe, Ellen, Ventresque, Anthony
Pose estimation systems are used in a variety of fields, from sports analytics to livestock care. Given their potential impact, it is paramount to systematically test their behaviour and potential for failure. This is a complex task due to the oracle problem and the high cost of manual labelling necessary to build ground truth keypoints. This problem is exacerbated by the fact that different applications require systems to focus on different subjects (e.g., human versus animal) or landmarks (e.g., only extremities versus whole body and face), which makes labelled test data rarely reusable. To combat these problems we propose MET-POSE, a metamorphic testing framework for pose estimation systems that bypasses the need for manual annotation while assessing the performance of these systems under different circumstances. MET-POSE thus allows users of pose estimation systems to assess the systems in conditions that more closely relate to their application without having to label an ad-hoc test dataset or rely only on available datasets, which may not be adapted to their application domain. While we define MET-POSE in general terms, we also present a non-exhaustive list of metamorphic rules that represent common challenges in computer vision applications, as well as a specific way to evaluate these rules. We then experimentally show the effectiveness of MET-POSE by applying it to Mediapipe Holistic, a state of the art human pose estimation system, with the FLIC and PHOENIX datasets. With these experiments, we outline numerous ways in which the outputs of MET-POSE can uncover faults in pose estimation systems at a similar or higher rate than classic testing using hand labelled data, and show that users can tailor the rule set they use to the faults and level of accuracy relevant to their application.
Evaluating Human Trajectory Prediction with Metamorphic Testing
Spieker, Helge, Belmecheri, Nassim, Gotlieb, Arnaud, Lazaar, Nadjib
The prediction of human trajectories is important for planning in autonomous systems that act in the real world, e.g. automated driving or mobile robots. Human trajectory prediction is a noisy process, and no prediction does precisely match any future trajectory. It is therefore approached as a stochastic problem, where the goal is to minimise the error between the true and the predicted trajectory. In this work, we explore the application of metamorphic testing for human trajectory prediction. Metamorphic testing is designed to handle unclear or missing test oracles. It is well-designed for human trajectory prediction, where there is no clear criterion of correct or incorrect human behaviour. Metamorphic relations rely on transformations over source test cases and exploit invariants. A setting well-designed for human trajectory prediction where there are many symmetries of expected human behaviour under variations of the input, e.g. mirroring and rescaling of the input data. We discuss how metamorphic testing can be applied to stochastic human trajectory prediction and introduce the Wasserstein Violation Criterion to statistically assess whether a follow-up test case violates a label-preserving metamorphic relation.
- Europe > Austria > Vienna (0.16)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- (4 more...)
- Transportation > Ground > Road (0.35)
- Information Technology > Robotics & Automation (0.35)
- Automobiles & Trucks (0.35)
Validating LLM-Generated Programs with Metamorphic Prompt Testing
The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code autonomously, significantly reducing the manual effort required for various programming tasks. Although, the potential benefits of LLM-generated code are vast, most notably in efficiency and rapid prototyping, as LLMs become increasingly integrated into the software development lifecycle and hence the supply chain, complex and multifaceted challenges arise as the code generated from these language models carry profound questions on quality and correctness. Research is required to comprehensively explore these critical concerns surrounding LLM-generated code. In this paper, we propose a novel solution called metamorphic prompt testing to address these challenges. Our intuitive observation is that intrinsic consistency always exists among correct code pieces but may not exist among flawed code pieces, so we can detect flaws in the code by detecting inconsistencies. Therefore, we can vary a given prompt to multiple prompts with paraphrasing, and to ask the LLM to acquire multiple versions of generated code, so that we can validate whether the semantic relations still hold in the acquired code through cross-validation. Our evaluation on HumanEval shows that metamorphic prompt testing is able to detect 75 percent of the erroneous programs generated by GPT-4, with a false positive rate of 8.6 percent.
- Research Report > Promising Solution (0.66)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)